Advances in Medicine, js 


Psychology and Public Health MPPH 


(AMPPH) 


Original Article in Clinical Psychology 


Zipf’s law: Divergence from general content in online 


communities of peers affected by attention- 


deficit/hyperactivity disorder (ADHD) and anorexia nervosa 


Cite this paper as: Tarchi L, 
Castellini G, Ricca V. Zipfs law: 
Divergence from general content in 
online communities of peers 
affected by attention- 
deficit/hyperactivity disorder 
(ADHD) and anorexia nervosa. 
Ady Med Psychol Public Health. 
2024;1(2):82-91. 


Doi: 10.5281 /zenodo.10637481 


Received: 10 November 2023 
Revised: 15 January 2024 


Accepted: 30 January 2024 


Livio TARCHI™, Giovanni CASTELLINE? , Valdo RICCA$ 


‘MD, Psychiatry Unit. Department of Health Sciences, University of Florence, Florence, Italy. E-mail: livio.tarchi@unifi.it. ORCID: 

0000-0002-9931-5621 

' MD, Psychiatry Unit. Department of Health Sciences, University of Florence, Florence, Italy. E-mail: giovanni.castellini@ unifi.it 

ORCID: 0000-0003-1265-49 1X 

' MD, Psychiatry Unit. Department of Health Sciences, University of Florence, Florence, Italy. E-mail: valdo.ircca@ unifi.it ORCID: 

0000-0002-9291-2124 

*Correspondence 

Abstract 

Introduction: Psycholinguistics studies the relationship between linguistic behavior and 
psychological factors. For what concerns ADHD (Attention-Deficit/Hyperactivity Disorder) 
and anorexia nervosa, psycholinguistics can offer insights into the language-related aspects of 
these conditions. The current study aimed to describe the relationship between rank frequency 
and frequency of use in written corpora retrieved from internet content. 
Methods: Linguistic content was retrieved from online communities (Reddit). A total of 2,500 
unique contributions were collected for each community (ADHD, anorexia nervosa, general 
content), composed of an initial script and each subsequent elicited comment. Zipf's law was 
adopted to estimate the constant and exponential terms describing the relationship between 
word frequency rank and total usage. 
Results: A total of 17,853 and 10,049 written prompts were retrieved for ADHD and anorexia 
nervosa, respectively. This content was compared to 13,386 comments from a general 
community. The ADHD community exhibited a higher number of average comments per 
submission, while both ADHD and anorexia nervosa showed higher words per post than the 
general community. Zipfs law exponential and constant terms were also higher between 
ADHD and anorexia nervosa compared to general content, suggesting a higher frequency of 
rare terms in these communities. Notably, while general content emphasized etiquette and 
community moderation, ADHD and anorexia nervosa discussions predominantly revolved 
around psychopathological terms. 
Discussion: This study underscores the unique linguistic characteristics within these 
communities, shedding light on the interplay between linguistic behavior and psychological 
factors in the context of ADHD and anorexia nervosa. Psycholinguistic tools could aid in the 
development of a better understanding of the lived experience of individuals suffering from a 


psychiatric disorder. 
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Take-home message: This study shows unique linguistic patterns in ADHD and anorexia 
nervosa online discussions, emphasizing psycholinguistics in understanding language use in 
online media. 

Keywords: communication; neurodivergence; internet use; quantitative linguistics; 


psycholinguistics; social media. 


INTRODUCTION 

In language, as in other realms of life, emergent properties can originate from the need to allocate scarce resources [1], 
[2]. For instance, in communication, in light of cognitive constraints, a balance must be reached between effectiveness, speed, 
specificity, and parsimony [3]. Empirical laws attempt to describe these emergent properties in language. Among them, Zipf's 
law formulates that the frequency of a word (in a written corpus of natural languages) is inversely proportional to its 
occurrence tank [4]. In other words, the most common word occurs twice as often as the second most common word, three 
times as often as the third, and so on. Early evidence showed that Zipf's law may not be maintained in speech or written 
forms of language in neurodevelopmental disorders, such as Autism Spectrum Disorder [5,6]. As another example of 
neurodevelopmental disorders, ADHD shows a similar onset and overlapping characteristics with Autism Spectrum 
Disorder [7,8], but no previous work has explored Zipf’s law in this clinical population. However, several empirical accounts 
described a high rate of language deficits in children with ADHD [7,9], warranting a higher focus on this topic. 

By contrast, Anorexia Nervosa (AN) has not been described as being associated with language deficits and does not show 
evidence in support of neurodevelopmental divergences [7]. Nonetheless, overlapping symptomatic characteristics have been 
observed between AN and Autism Spectrum Disorder [10]. Moreover, chronic calorie restriction is known to reduce 
cognitive functioning, but a complex mechanism might underlie differential longitudinal trajectories based on sex and onset 
of restriction [11]. In fact, animal studies offered evidence that late-onset restriction may worsen cognitive performances, 
while conversely, exposure to calorie restriction during development might be protective and beneficial [11]. However, the 
effect of calorie restriction seems selective for specific functions [12] and may not encompass all cognitive domains [13]. 

General and specific psychopathology might interfere in the before-mentioned allocation of scarce resources for what 
concerns language. Depression and social anxiety influence verbal expression by emotional status, interpersonal stances, and 
cognitive processes [14-16]. In ADHD, specific psychopathology might interfere with the law describing the relationship 
between rank and frequency of use for words in neurotypical linguistic corpora. Core symptoms such as inattention and 
hyperactivity might tax-writing or reading skills [17,18]. Additionally, complex phenomena such as hyperfocus or proficiency 
in uncommon interests might stress vocabulary divergence [19-21]. To the authors’ knowledge, no study has observed 
language deficits in clinical samples of patients affected by Anorexia Nervosa. Yet, the scientific literature has focused on 
the social aspect of eating disorders [22], and a specific lexicon, morally driven and morally informed, might stress vocabulary 
divergence and use in patients or the general public interacting with them [23]. 

Aims and hypothesis 

For these reasons, a necessary discrimination between the contribution of general and specific psychopathology is needed 
before natural language processing might inform clinical practice. The current study aimed to describe the relationship 
between rank frequency and frequency of use in written corpora retrieved from internet content. Written corpora of online 
communities of peers affected by ADHD and Anorexia Nervosa were compared with general contents, unspecific of 
psychopathology. 

The working hypothesis was that online communities devoted to specific psychopathologies would see an overinflation 


of most frequent words (those associated with conditions, needs, and symptoms) and less frequent use of singular vocabulary. 
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To control for these factors, the most distinct words for each community were computed and described. 
METHODS 
Sample description and pre-processing 

The community of r/ADHD is an inclusive, disability-oriented support group of peers affected by ADHD. The 
community welcomes sharing stories, difficulties, and non-pharmaceutical strategies to cope with symptoms. The 
community has almost 1,5 million subscribers, with the highest growth between 2018 and 2021. The subreddit r/ Anorexia 
Nervosa describes itself as a safe place for those affected by the disorder or in recovery. It counts approximately 36,720 
subscribers and had the highest growth between 2019 and 2021. By contrast, the board of r/ShowerThoughts is devoted to 
sharing general wisdom gained during ordinary events. It mainly grew between 2014 and 2019 and counts over 25,126,800 
to the present day. 

These three reddit communities were searched, and a random sample of 2,500 posts was selected for each subreddit. All 
comments posted for each 2,500 submissions were retrieved, and a final written corpus of all posts and comments was 
defined for each community. The sample size and procedure reflect previous work on analyzing mental health correlates on 
social media platforms like Reddit [24]. To avoid over-sampling lexicon associated with COVID-19, pandemics, infectious 
diseases, contagions, or symptoms [25], only submissions and comments posted between 1 January 2019 and 31 December 
2019 were included. All analyses were conducted in Python 3.8.13, with the support of the following libraries: “pandas” [26]; 
“numpy” [27]; “scipy” [28]; “matplotlib” [29]; “clean-text” [30]; “seaborn” [31]; “ntlk” [32]; “genism” [33]; “sklearn’” [34]. 

Each community's written corpus was pre-processed to remove the upper case, punctuation, control characters, URLs, 
line breaks, e-mails, phone numbers, numbers, digits, and currency symbols [30]. The overall written corpus was divided 
into words, and each word stemmed to retrieve its lexical meaning [32]. Stop words (e.g., “the”, “not”, “and” etc.) were 
removed [32,33]. Both stemming and word stops were performed according to standard practices in natural language 
processing [32]. 

Empirical distribution of frequency and rank 

Zipf’s law is defined by the following: 

1 
FW) = yz 

That is, the frequency of the N'* word is given by the power law with exponent Z. The standard Zipf’s law describes Z 
= 1, while C is a constant that allows the distribution to be unitary. In order to fit Zipfs distribution to each corpus, the 
Maximum Likelihood Estimation (MLE) of Z was derived according to procedures defined by Moreno-Sanchez and 
colleagues [35]. The goodness of fit was evaluated by computing each curve's Mean Absolute Error (MAE; [36]) and Mean 
Squared Log Error (MSLE; [34]). The MAE of a model is given by the sum of differences between observed and predicted 
values divided by the sample size [36]. As exponential distributions, standard measurements such as Mean Squared Error 
(MSE) might be biased in the logarithmic scale. For these reasons, a variation of MSE, namely MSLE, which computes the 
relative difference between observed and predicted values and the percentual difference between them, was preferred [34]. 
Most distinctively frequent words 

In order to characterize the most distinctively frequent words, the Text Frequency — Inverse Document Frequency (TF- 
IDF) was adopted [32]. TF-IDF counts the number of times a term occurs in a document and divides it by the total number 
of documents where the same term appears. This procedure aims to reduce the weight of very frequent but not distinct 
terms in a document. A selection of the top 10 words by TF-IDF score was chosen to represent the most distinct words for 


each community. 
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RESULTS 
Sample descriptives 

A total of 2,500 submissions were retrieved, as planned, for each of the three online communities described. Comments 
posted for each submission were associated with the original body of text, and 41,288 comments were analyzed. In particular, 
the community of peers affected by ADHD had 17,853 comments and an average of 7.141 comments per submission. The 
community of peers affected by Anorexia Nervosa had a total of 10,049 comments and an average of 4.020 comments per 
submission. Finally, the general community, unspecific of psychopathology, had a total of 13,386 comments and an average 


of 5.354 comments per submission. Please see Table 1 for a tabular representation of the sample. 


Table 1. Sample descriptive. 


Total number of Average comments per 
Subreddit Words per Posts 
comments submission 
t/ADHD 17,853 7.141 27.649 
t/AnorexiaNervosa 10,049 4.020 24.515 
t/ShowerThoughts 13,386 5.354 11.108 


Nore: All subreddits were sampled for 2,500 submissions, with content posted from 1 January 2019 to 31 December 2019. Words per 


post given by both submissions and comments. 


Empitical distributions of words 

The frequency of each unique word was calculated for each online community. ADHD was the community with the 
highest total words (493,616) but the lowest prevalence of unique words in the sample (4.87%). Anorexia Nervosa, on the 
other hand, had a total of 246,348 words and an intermediate prevalence of unique words (6.29%) between ADHD and 
general content. General content, in fact, had the lowest number of total words (148,696) but the highest prevalence of 
unique words (10.21%). Moreover, a gradient in the goodness of fit for Zipf's law was observed. The community of peers 
with ADHD showed the worst goodness of fit MAE 22.056; MSLE 3.633), Anorexia Nervosa an intermediate goodness 
of fit (MAE 18.872; MSLE 3.477), and the general community the best (MAE 10.658; MSLE 2.624). 


Table 2. Frequency of unique words, empirical distributions. 


Total Unique Z C 
Subreddit MAE MSLE 
words words exponent constant 
27,037 
t/ADHD 493,616 1.260 4.169 22.056 3.633 
(4.87%) 
15,514 
r/AnorexiaNervosa 246,348 1.241 4.336 18.872 3.477 
(6.29%) 
15,187 
t/Shower'Thoughts 148,696 1.055 8.027 10.658 2.624 
(10.21%) 


Note: Prevalence of unique words is given by dividing unique words by total words. 


MAE: Mean Absolute Error; MSLE: Mean Squared Log Error. 


The Z exponent and C constant of Zipfs law were fit according to MLE for each written corpus. The most used words 


were less frequent than expected across all communities. However, a higher frequency of rare words (the right-hand side of 
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the distribution) was observed only in communities of peers affected by ADHD or AN. The divergence from theoretical 


distribution (best fitted according to MLE as described above) interested words ranked 1000 or lower. The results can be 


found in Figure 1. 
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Figure 1. Relationship between frequency and rank of each word, theoretical and empirical distributions. 


Most distinct words 
The 10 most distinct words per community are represented in Table 3. Both the communities of peers affected by ADHD 


and AN were more distinctively represented by the words “time” and “people”. The ADHD community was represented 
mainly by its specific term (‘ADHD”, TF-IDF 0.289), as the community of AN patients (“eating,’ TF-IDF 0.210, “eat”, 
TF-IDF 0.195, “weight”, TF-IDF 0.212). By contrast, the general community of r/ShowerThought was distinctively 


represented by words pertaining to the realm of etiquette and moderation in the community (“‘removed,” “please”, 


” “rules,” “moderators’’). 


"frequently asked questions, FAQ, 
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Table 3. Distinct word per community. 


r/ADHD r/AnorexiaNervosa t/ShowerThoughts 

Word TF-IDF Word TF-IDF Word TF-IDF 
ADHD 0.289 Feel 0.236 Please 0.313 
Get 0.259 Know 0.226 Removed 0.237 
Time 0.206 Weight 0.212 FAQ 0.233 
One 0.154 Eating 0.210 Rules 0.227 
Really 0.152 Get 0.198 Reddit 0.224 
Know 0.151 Eat 0.195 Wiki 0.208 
Things 0.148 Really 0.165 Moderators 0.202 
People 0.148 Want 0.158 Google 0.174 
Feel 0.142 Time 0.146 Games 0.174 
Work 0.132 People 0.144 Automatically 0.161 


Note: TF-IDF: Text Frequency — Inverse Document Frequency. 


DISCUSSION 

The current study aimed to explore the linguistic patterns in online communities of peers affected by ADHD and 
Anorexia Netvosa (AN) using Zipfs law as a framework. Contrary to the working hypothesis, the findings indicated no 
overinflation of the most frequent words in both communities. Instead, a higher frequency of unique, less-used words was 
observed. This divergence from expected linguistic patterns under Zipf’s law may reflect the unique communicative needs 
and experiences of individuals in these groups. 

In the ADHD community, the prevalence of unique words could be linked to the diverse and idiosyncratic nature of the 
disorder [37]. ADHD's core symptoms, like inattention and hyperactivity, might lead to a broader, more varied lexical choice 
as individuals seek to articulate their distinct experiences and coping strategies. This is further supported by the presence of 
words like "ADHD," "time," and "people" among the most distinct words, highlighting the community's focus on personal 
experiences and social interactions. 

Similarly, the AN community displayed a unique lexical pattern, potentially reflecting the complex emotional and physical 


experiences associated with eating disorders [38-40]. Words like "eating," " 


weight," and "eat" were prominent, indicating a 
focused discussion on specific aspects of the disorder. The high prevalence of unique words in this community might indicate 
the nuanced and personal nature of discussing and coping with AN and sharing diverse personal stories and recovery 
journeys [41,42]. 

The current descriptive analysis using Zipfs law revealed that the ADHD community had a lower prevalence of unique 


words compared to the AN community. Still, both deviated from the general content regarding word frequency distribution. 
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This suggests that while both communities tend towards lexical diversity, the nature and context of their discussions differ 
significantly. 

The TF-IDF analysis further illuminated these differences, revealing the most distinct words in each community. This 
analysis provides a deeper understanding of each group's thematic focuses and linguistic peculiarities [43,44]. 
Future directions 

The findings of this study open new avenues for future research. A deeper exploration into the psycholinguistic patterns 
of other neurodivergent communities could provide valuable insights into the unique ways these groups use language. 
Additionally, longitudinal studies could track changes in language use over time, particularly as members of these 
communities evolve in their understanding and experiences of their conditions. 
Study limitations 

This study, however, is not without limitations. Firstly, the clinical samples are divergent and lack an objective definition 
of psychopathology or confirmation of diagnosis. This raises questions about the generalizability of our findings to all 
individuals with ADHD or AN. Furthermore, relying on online written text may not fully capture the complexities of 
language use in these communities, as online communication can significantly differ from other forms of expression. 
CONCLUSIONS 

Contrary to the working hypothesis, neither of the two online communities of peers showed an overinflation of the most 
frequent words. By contrast, a higher frequency of unique, less-used words was observed in both. A divergence from the 
content retrieved by general communities was present in the online discourse of patients with ADHD and AN. 

In conclusion, our study reveals intriguing linguistic patterns in online communities of peers affected by ADHD and AN, 
challenging the traditional understanding of Zipfs law in the context of neurodivergence. These findings underscore the 
importance of considering unique linguistic expressions in psychopathological research and highlight language use's rich, 


diverse nature in neurodivergent communities. 
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